A Data Augmentation Method for English-Vietnamese Neural Machine Translation
نویسندگان
چکیده
The translation quality of machine systems depends on the parallel corpus used for training, in particular quantity and corpus. However, building a high-quality large-scale is complex expensive, particularly specific domain Therefore, data augmentation techniques are widely translation. input back-translation method monolingual text, which available from many sources, therefore this can be easily effectively implemented to generate synthetic data. In practice, texts collected different sources websites often have errors grammar spelling, sentence mismatch or freestyle. output reduced, leading low-quality generated by back-translation. study, we propose improving Moreover, supplemented pruning table. We experimented with an English-Vietnamese neural using IWSLT2015 dataset training testing legal domain. results showed that proposed augment translation, thereby quality. our experimental cases, BLEU score increased 16.37 points compared baseline system.
منابع مشابه
Data Augmentation for Low-Resource Neural Machine Translation
The quality of a Neural Machine Translation system depends substantially on the availability of sizable parallel corpora. For low-resource language pairs this is not the case, resulting in poor translation quality. Inspired by work in computer vision, we propose a novel data augmentation approach that targets low-frequency words by generating new sentence pairs containing rare words in new, syn...
متن کاملGeneration of Vietnamese for French-Vietnamese and English-Vietnamese Machine Translation
This paper presents the implementation of the Vietnamese generation module in ITS3, a multilingual machine translation (MT) system based on the Government & Binding (GB) theory. Despite well-designed generic mechanisms of the system, it turned out that the task of generating Vietnamese posed non-trivial problems. We therefore had to deviate from the generic code and make new design and implemen...
متن کاملPivoting Methods and Data for Czech-Vietnamese Translation via English
The statistical approach to machine translation (MT) relies heavily on large parallel corpora. For many language pairs, this can be a significant obstacle. A promising alternative is pivoting, i.e. making use of a third language to support the translation. There are a number of pivoting methods, but unfortunately, they were not evaluated in comparable settings. We focus on one particular langua...
متن کاملBuilding A Training Corpus For Word Sense Disambiguation In English-To-Vietnamese Machine Translation
The most difficult task in machine translation is the elimination of ambiguity in human languages. A certain word in English as well as Vietnamese often has different meanings which depend on their syntactical position in the sentence and the actual context. In order to solve this ambiguation, formerly, people used to resort to many hand-coded rules. Nevertheless, manually building these rules ...
متن کاملDynamic Data Selection for Neural Machine Translation
Intelligent selection of training data has proven a successful technique to simultaneously increase training efficiency and translation performance for phrase-based machine translation (PBMT). With the recent increase in popularity of neural machine translation (NMT), we explore in this paper to what extent and how NMT can also benefit from data selection. While state-of-the-art data selection ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Access
سال: 2023
ISSN: ['2169-3536']
DOI: https://doi.org/10.1109/access.2023.3252898